-
Notifications
You must be signed in to change notification settings - Fork 43
Support Unicode international characters #151
Conversation
- Punycode domains - utilize .codePointAt instead of .charCodeAt - tighten restricted codes to control characters instead of non-latin ranges
@skeggse this is going to fail linting because of the Please let me know if I missed anything, or if this is not what you had in mind. I'm happy to push updates to this PR. |
Actually, it occurs to me that I can escape that line from the linter. I'm going to push a commit for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please add a few tests that verify the correct byte length checking behavior for long unicode email addresses - iirc the length limits on email addresses are in numbers of octets, as captured here and here.
Can you comment on whether this line is still correct, given the check against 127
? Same for this line, this one, and this one.
Please bump the version to at least 2.3.0
. We might consider either going to 3.0.0
, or adding an allowUnicode
flag to selectively enable this (which would be more complicated). It's possible others depend on this module rejecting unicode email address, and it's also possible that they didn't consider it, and will start having issues if they accept a unicode email address, but can't handle it in their delivery mechanism.
EDIT: sorry for making these concerns kinda late in the process - they did not occur to me sooner.
test/tmp.js
Outdated
checkDNS: true | ||
}, () => { | ||
|
||
expect(internals.punyDomain).to.equal('xn--5nqv22n.xn--lhr59c'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but depends on the exact behavior of punycode. I'd rather not test whether punycode did the right thing, just whether validate
called punycode correctly. Instead, could you put this?
expect(punycode.toUnicode(internals.punyDomain)).to.equal('伊昭傑@郵件.商務');
Thanks for the quick review. Addressing these issues now. 👍 |
Ah, I also just noticed RFC 6532, which, in Section 3.1 discusses the appropriate normalization form for the email address. It looks like |
I missed RFC 6532... Reading through it now. |
Thanks @robhorrigan :D @skeggse re: normalization - I don't know that it is necessary, and it may cause some unintended side effects. I'm looking at Section 4 of RFC 6532. Specifically the fact that normalization can cause the length of the address to exceed requirements/overflow buffers in edge cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented on the how, not the what, as you know more than me after reading that RFC :)
@skeggse you should keep versioning under your control and not part of this PR.
lib/index.js
Outdated
|
||
// add C0 control characters | ||
|
||
for (let i = 0; i < 33; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this package only supports node 4+, you can probably use lookup.fill
to initialize it.
package.json
Outdated
@@ -18,10 +18,12 @@ | |||
"node": ">=4.0.0" | |||
}, | |||
"dependencies": { | |||
"punycode": "^2.1.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep hapi versioning scheme (2.1.x).
@@ -0,0 +1,47 @@ | |||
'use strict'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tmp.js
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skeggse's preference is to utilize a proper internationalized domain with an active MX record. I have struck out on finding such a domain, so this approach was his backup plan. I'm guessing tmp.js
was so that it doesn't become the permanent testing solution by default. :P
lib/index.js
Outdated
@@ -2,8 +2,8 @@ | |||
|
|||
// Load modules | |||
|
|||
const Dns = require('dns'); | |||
|
|||
let Dns = require('dns'); // eslint-disable-line prefer-const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think proxyquire would allow you to keep that const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skeggse - Do you prefer rewire
or is it worth it to utilize proxyquire
to prevent the ESLint override?
@robhorrigan and I have started on addressing the changes you requested @skeggse. Planning on having the additional tests and refactors up tomorrow AM. We're close :) |
@WesTyler after rereading Section 4 of RFC 6532, I'm not sure I agree with your comment. I read that section as saying that normalization should mitigate some of the security issues. I'm also concerned about introducing this change without also providing a warning to users reminding them to normalize Unicode emails. Thoughts? Also thanks for the input, @Marsup. I'll update the version separately. @WesTyler is right about tmp.js, though I continue to be open to a better file name. Per proxyquire, I'm not sure. I prefer it when we don't make changes to source specifically for testing. |
proxyquire acts on the require cache, it seems better than rewire for your usage. |
Ah, I think I misread the docs for proxyquire when I skimmed them. Yeah, that looks good. |
@skeggse You're absolutely right about the normalization. I misread that section and missed the key part:
Here are my action items for this PR:
|
@skeggse - need to call in the big guns on this. |
This project was originally ported from the is_email php function. I pulled nearly all the test-cases from that project. My guess is that it was only chosen because it was outside the ASCII range. |
@skeggse Ok, we are at the last change requested - normalization. It looks like Node.js is already handling UTF-8 encoded unicode characters. Thoughts?? |
JavaScript strings are UCS-2 encoded, not UTF-8, so you'll need to try a character that needs to be represented as a surrogate pair in UCS-2. |
Oh geez. It's been a long day... XD |
Ok, so, this is as close as @robhorrigan and I have been able to get for writing tests to cover normalization: > punycode.toUnicode(punycode.toASCII('man\u0303ana.com')) === 'mañana.com'
false
> punycode.toUnicode(punycode.toASCII('man\u0303ana.com'.normalize())) === 'mañana.com'
true So our thought is to pass in the email 'testing@man\u0303ana.com' in expect(punycode.toUnicode(internals.punyDomain)).to.equal('mañana.com'); Does that meet your expectations for |
Yeah that'll do for now. |
- adds UTF-16 surrogate pair characters to tests.json - adds exported `.normalize` method and associated test - adds test in `tmp.js` for punycoded normalized characters
@skeggse - it looks like the behavior for https://github.com/v8/v8/blob/master/ChangeLog#L728 Node v6.x (all existing tests pass): > '\"test\u0000\"@iana.org'.normalize()
'"test\u0000"@iana.org' Node v4.x (normalize breaks existing tests with NULL): > '\"test\u0000\"@iana.org'.normalize()
'"test' [EDIT] Working on a |
Ok, awesome! (technically there's no monkey-patching in that last commit, but it is certainly a workaround) I'll review the whole thing one more time this evening. |
Ha my bad, I guess you're right. I was thinking monkey-patch since it was checking process.version, but that doesn't mean it's modifying the existing Thanks for the patience and help with this throughout the week |
lib/index.js
Outdated
@@ -1413,5 +1426,12 @@ exports.diagnoses = internals.validate.diagnoses = (function () { | |||
|
|||
exports.normalize = internals.normalize = function (email) { | |||
|
|||
// $lab:coverage:off$ | |||
if (process.version[1] === '4' && email.match(/\0/g)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend replacing regex
with a simple indexOf
for performance reasons.
if (process.version[1] === '4' && email.indexOf('\u0000') === 0) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On it. Just haven't pushed up yet :)
Also, using email.indexOf('/u0000') >= 0
because the NUL is not necessarily at the beginning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop that iterates over the characters in the string won't handle surrogate pairs. Is this appropriate? It seems like it would be better to consume an entire unicode code point for each iteration of that loop. Furthermore, on the next line, the token = email[i];
won't handle surrogate pairs either. Maybe token = email.codePointAt(i)
and i += token.length
or something. Please double-check the use of i
as an iterator, including its uses inside the loop.
I added some length checks that fail, as they don't consider the octet count correctly.
Ok, I think I found a way to check for surrogate pairs. I've modified the iterator to grab the entire pair as the token and then "skip" the second octet. I'm pushing up that commit so you can check the modification, but the tests are still failing. My question now is about the test cases and the asserted results. For the surrogate pair Also, if that is the case, I need to update the way the |
It's the exact opposite, actually. The original rules about limits on the number of octets for, say, the entire local part still apply. As a result, I just realized the tests for the domain part might be wrong, though - the punycode version is what would actually be sent over the wire. |
Wait, what?? What in the world am I reading horribly wrong?? Haha |
|
Ohhhhhhh. I see now. Got it, thank you. I updated the iterator, but I still do need to account for the difference here:
|
for (let i = 0; i < emailLength; i += token.length) {
token = email.codePointAt(i); |
Is there something wrong or undesirable with the iterator update in this commit? |
If not, then I think my latest update (checking Buffer.byteLength instead of string.length) may be the final piece.
|
The method you used to pull surrogate pairs out of the string functions. That said, it doesn't add anything - it makes the code harder to read, and I'd prefer readable and concise code. It's also misleading: within the loop, we expect There are a number of lines that look like: if (emailLength === ++i || email[i] !== '\n') { This will break in my proposed iterator. I'd still prefer to move in that direction specifically because it makes |
Alright, that's fair. I'll see what I can do to clean it up though. |
Ah, crud. You're right. |
I would be on board with that if you have the time. Gotta try to get some time in with my wife and baby daughter this weekend :) |
@skeggse I had the chance to replace my for (let i = 0; i < emailLength; i += token.length) {
token = String.fromCodePoint(email.codePointAt(i)); I didn't end up needing to change any of the iterator-related code, only the logic in quote pairs that prepends Is this closer to what you were hoping to see? |
Wonderful! I'll get these out on |
Hey, no worries! I ended up having some bandwidth today and it wasn't as involved as I was worried it would be. Thanks again for all of the patience and help with this :D |
// Check if it's a neither a number nor a latin letter | ||
else if (charCode < 48 || charCode > 122 || (charCode > 57 && charCode < 65) || (charCode > 90 && charCode < 97)) { | ||
// Check if it's a neither a number nor a latin/unicode letter | ||
else if (charCode < 48 || (charCode > 122 && charCode < 192) || (charCode > 57 && charCode < 65) || (charCode > 90 && charCode < 97)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WesTyler I'm reading through some of this code as I refactor, and now I find myself asking where 192
came from. You wouldn't happen to recall, would you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I believe 192 is the beginning of the Unicode "international" character set (À).
If I remember correctly, Unicode #s 123-191 are all non-character symbols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, ok. Thanks!
RFC references:
isemail
Addresses #17